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The common thing to do when planning a trip is to search for a tourist 
destination. This process is often done using search engines and reading 
articles on the internet. However, it takes much time to search for such 
information, as to obtain relevant information, we have to read some 
available articles. Named entity recognition (NER) can detect named entities 


in a text to help users find the desired information. This study aims to create 
a NER model that will help to detect tourist attractions in an article. The 
articles used for the dataset are English articles obtained from the internet. 
We built our NER model using bidirectional long-short term memory 
(BiLSTM) and conditional random fields (CRF), with Word2Vec as a 
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1. INTRODUCTION 

Tourism growth happens almost every year. In 2019, worldwide international tourist arrivals 
reached 1.5 billion, increasing by 4% from the previous year [1]. With the advent of the web 2.0 age, internet 
users frequently share their travel experiences via websites [2]. A study conducted by Google Travel found 
that 74% of travelers plan their trips via the internet [3]. The search for tourist destinations is one of the steps 
that is generally carried out when planning a trip. Most people commonly search for information regarding 
tourism destinations through reviews, websites, and articles on the internet [4], [5]. However, searching, 
selecting, and reading details of each piece of information through travel guidebooks or portal sites is time- 
consuming [6], [7]. The time-consuming issue of getting travel information from texts can be solved by 
applying information extraction. 

The named entity recognition (NER) task can be applied to extract information from texts in the 
natural language processing area. NER can be defined as a task of extracting entities from text documents, 
such as a person’s name, location, or organization [8]-[10]. Various NER approaches are Rule-Based, 
Machine Learning which includes hidden markov model (HMM), maximum entropy, decision tree, support 
vector machines, conditional random fields (CRF), and Hybrid Approaches [11]. There are also other 
methods, such as recurrent neural network (RNN) and its variant, long short-term memory (LSTM) which 
has been successfully used in various prediction problems sequences, such as NER, language modeling, and 
speech recognition [12]. Many researchers have also studied NER in multiple fields, such as the geological 
domain [13]. They developed a NER system for geological text using the CRF method and 
IITKGP-GEOCORP dataset developed from article collections and scientific reports containing geology- 
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related information in India. In the biomedical domain, the BiLSTM-CRF (bidirectional long-short term 
memory-conditional random fields) model was used to recognize drug names in biomedical literature [14]. 
They used the concatenation of word embedding and character level embedding for the input vector. Their 
experiments obtained an Fl-Score of 92.04% and achieved comparable performances with the state-of-the-art 
results on two datasets. Word embedding was used as a feature by [15] in named entity recognition research 
and produced a better result than the baseline without word embedding. 

A study by [16] proposed BiLSTM-CRF for their NER system using three datasets, Penn Treebank 
POS Tagging, conference on computational natural language learning (CoNLL) 2000 chunking, and CoNLL 
2003 named entity tagging. The research revealed that the BiLSTM-CRF model could efficiently use input 
features from the past and future because of the two-way LSTM component. Moreover, the CRF layer can 
provide the label information at the sentence level. From several scenarios used in that research, the 
BiLSTM-CRF model yielded the best results in almost all datasets. The merging of BiLSTM and CRF layers 
was also carried out in [17], and the results showed that such merging could solve the problem of the inability 
to handle strong dependencies of tags in a sequence. A Chinese dataset and BiLSTM-CRF model were used 
in [18]. It was found that using a dictionary produced better performance than using only the BiLSTM-CRF 
model. The work in [19], which used the BiLSTM-CRF model along with pre-trained word embeddings, 
character embeddings, and dictionary information, succeeded in improving the performance of the Disease- 
NER system. They used pre-trained word embeddings using skip-gram that combined domain-specific text 
(PMC and PubMed texts) with generic text (english wikipedia dump). The BiLSTM-CRF model can also be 
combined with bidirectional encoder representations from transformers (BERT) [20]. The combination of the 
three methods is called BBLC. They also compared the performance of the BBLC model with the BiLSTM- 
CRF model using the same dataset. As a result, the BBLC obtained a higher F1-Score than the BiLSTM-CRF 
on some entities, such as location, organization, and thing. On the other hand, the BiLSTM-CRF model 
achieved a higher Fl-Score than BBLC on the time entity. 

Furthermore, NER can extract meaningful information from tourism websites by identifying the 
named entities. In the tourism domain, identified entities can be the names of tourist attractions, places of 
lodging, facilities, and locations. Identifying related entities is expected to make it easier for potential tourists 
to find tourist destinations via the internet. However, many NER studies in the tourism domain did not focus 
on categorizing the characteristics of tourist attractions. We argue that classifying the characteristics of the 
tourist attraction, such as natural, heritage, or purposefully built, is essential to help users make decisions for 
future utilization of our NER system. Thus, in this study, we present a NER system to aid tourists in finding 
tourist destinations from articles by extracting tourist attractions into four categories such as “natural”, 
“heritage”, “purposefully built” (artificial), and “outside”. This study proposes a combination of Word2vec 
and BiLSTM-CREF approaches to building the NER system for tourism. The implementation of Word2vec in 
our research is inspired by [21]. Additionally, we explore the performance of Word2vec using two different 
Word2vec algorithms called skip-gram and continuous bag-of-words (CBOW). 

There is a previous study with different tags/labels for tourism NER. Saputro ef al. [22] used five 
labels, namely “nature”, “place”, “city”, “region”, and “negative” as the named entity. The proposed system 
has scored 70.43% of accuracy, with an F-Score of 69%. However, some of the labels used in this study are 
not specific enough to identify the characteristics of the tourist attractions. For example, when our goal is to 
extract the name of a tourist attraction and its characteristics (whether they are natural, heritage, or 
purposefully built), the labels “city” and “region” are too broad. Another study proposed a corpus for tourism 
NER in the Mongolian language [23]. Thus, although the studies are similar, direct comparison is impossible 
because the labels and the language are different. 

The rest of this paper is structured as follows. Section 2 describes the methodology of our study. In 
section 3, the result and discussion of this study are presented. Finally, section 4 provides the conclusion and 
future work. 


2. METHOD 

This section provides the methodology conducted in our study. Our work consists of six steps such 
as data retrieval, pre-processing, data labeling, feature extraction, classification, and evaluation. A more 
detailed explanation of each step is described later. 


2.1. Data retrieval 

The data for this study was gathered via web scraping techniques from English tourism articles. We 
searched the articles using two distinct methods: keyword searches and scraping articles directly from 
predetermined websites. The articles were indexed using the following keywords: top tourist cities, best 
places to visit, top world heritage sites, world heritage list, best destinations for nature lovers, and best 
natural tourist attractions. Scraping was accomplished using Python's Newspaper module, which is capable 
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of extracting and parsing text from website articles. Each article was processed independently and then saved 
in a file with the *.txt extension. We gathered our dataset from 24 websites and obtained 92 articles 
containing 8,500 sentences, 17,137 unique words, and 183,507 tokens. 


2.2. Pre-processing 

This pre-processing task aims to eliminate all meaningless characters and preserve the remaining 
valuable words [24]. The pre-processing techniques carried out in this research were removing URLs, 
emoticons, and tokenization. URLs and emoticons were removed because they do not significantly affect 
recognizing a named entity. Moreover, tokenization was done by dividing the sentences into smaller units 
called tokens. 


2.3. Data labelling 

We manually labeled the tourist attractions into four categories: natural, heritage, purposefully built 
(artificial), and outside. We adopted the BIO tagging format in the labeling process, where a token is tagged 
as B-label if it is the beginning of a named entity, I-label if it is within a named entity but not the first, and O- 
label represents otherwise [25]. In this study, the B-prefix represents the first word of a tourist attraction's 
name, while the I-prefix represents the second through the last word of a tourist attraction's name. We 
considered natural products to be labeled as a natural category in this study, which is open to the public and 
provides natural views such as waterfalls, mountains, caves, rivers, and glaciers. Tourist attractions that have 
been around for a long time are ancient, historic, and often cultural, or tend to represent culture and heritage, 
and places of worship fall into the heritage category. Ruins, monuments, temples, forts, castles, mosques, and 
cathedrals are categorized as heritage structures. Tourist attractions that are purposefully built to attract 
visitors, such as museums, markets, and amusement parks, will be classified as purposefully built. The final 
category, outside, is for a word that is not considered a tourist attraction. 

The number of tokens for each label is shown in Table 1. Our dataset was imbalanced since the label 
O dominated our dataset with 171,728 tokens. The I-PURPOSE and I-HERITAGE labels consist of 2,874 
and 2,051 tokens, respectively. The number of B-NATURAL, I-NATURAL, and B-PURPOSE were almost 
the same. The number B-HERITAGE label was the lowest, with 1,401 tokens. 


Table 1. Number of tokens for each label 


Label Number of Tokens 
oO 171,728 
B-NATURAL 1,789 
I-NATURAL 1,853 
B-HERITAGE 1,401 
J-HERITAGE 2,051 
B-PURPOSE 1,811 
I-PURPOSE 2,874 


2.4. Feature extraction using word embeddings 

This study applied word embeddings called Word2vec as the feature extraction method. Word2vec 
has two different approaches, continuous bag-of-word (CBOW) and skip-gram. In our experiments, we 
compared the application of CBOW and skip-gram to obtain the best result. The CBOW algorithm predicts a 
target word based on its context, whereas the Skip-gram algorithm predicts the target context based on a 
word. The CBOW model aims to predict the middle word by combining the representation of the surrounding 
words. The Skip-gram model generates a word vector representation capable of predicting the context of the 
word. 

Additionally, both models require little training time and can be applied to a large corpus [26]. The 
textual input is converted to vectors using Word2Vec and then trained to generate a dictionary. The 
dictionary contains the same number of words as the dataset's unique words, and each word has its vector. In 
the following step, the dictionary containing the pre-trained vectors will be used as weights for the 
embedding layer. 


2.5. Classification 

We used StratifiedKFold from Scikit-Learn to divide our data into ten subsets. One out of ten 
subsets were used as the test data, and the other subsets acted as the train data. The total sample in this study 
was 8,500, and the total sample is the total sentences in the dataset. After splitting the data, we have 7,650 
train data and 850 test data. During training, we used 20% of the train data to be used as validation data, so 
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that the train data ended up being 6,120 samples and the validation data was 1,530 samples. The average 
number of tokens used for each label in each fold for training data is shown in Table 2, and for test data is 
presented in Table 3. A large number of O labels result from the post padding sequences performed on each 
sample with O as the padding value. 


Table 2. The average number of tokens for each label in each fold (training data) 


Label Average Number of Tokens 
oO 678,186 
B-NATURAL 1,602 
I-NATURAL 1,659 
B-HERITAGE 1,233 
I-HERITAGE 1,803 
B-PURPOSE 1,554 
I-PURPOSE 2,459 


Table 3. The average number of tokens for each label in each fold (test data) 


Label Average Number of Tokens 
O 75,354 
B-NATURAL 178 
I-NATURAL 184 
B-HERITAGE 137 
I-HERITAGE 200 
B-PURPOSE 172 
I-PURPOSE 273 


The input data were classified into predefined categories by combining two methods, BiLSTM and 
CRF. BiLSTM is made up of two LSTMs, forward LSTM and backward LSTM. As a modification of RNN, 
LSTM has a memory cell that can store information for a long period. When dealing with long sequential 
data, the vanishing gradient problem encountered in RNN can be addressed with LSTM by utilizing gates 
that control the information entering the memory [27]. 

LSTM is useful to be applied in sequential labeling cases since its capability to gain the information 
from both front and back sides of the texts. The hidden state in the LSTM, on the other hand, only retrieves 
information from the previous part, leaving the next part unknown. To overcome these issues, BiLSTM can 
be applied [28]. In BiLSTM, the combination of forward LSTM and backward LSTM will capture 
information from both directions. The output of forward and backward LSTM then be combined using the 
sigmoid function (co) as shown in (1). It can be a concatenation function, an addition function, an average 


function, or a multiplication function. The y,; represents the output at time ¢, while h represents the hidden 
state from the forward layer and h* represents the hidden state from the backward layer. 


Ye = o(h,h’) (1) 


The proposed BiLSTM-CREF architecture in this research is depicted in Figure 1. The input layer 
receives the words. These words are represented by a vector of integer values. Each word's value was 
generated using the pre-trained word2vec. Additionally, the embedding layer's output serves as the input for 
the following BiLSTM layer. Moreover, a decision function based on the CRF layer was used to generate the 
label sequence. CRF is a method to obtain global optimum predictions using a conditional probability 
distribution model [29]. The CRF layer labels the sequence using the surrounding labels. The labels 
preceding and following the current word can aid in predicting labels for the current word. There are two 
types of scoring calculations in the CRF method: emission and transition scores. The emission scores in this 
model are derived from the output score matrix for the preceding layer. While transition scores are initially 
assigned randomly, they will be updated throughout the training process. The two scores will be used to 
predict the final output sequence of labels. 


2.6. Evaluation 

To measure the performance of NER, [15] argues that the measurement using the Fl-Score is more 
suitable than the accuracy. This is because most of the NER data labels are labeled as O, which refers to tokens 
that are not an entity named (named entity), and thus high accuracy can be obtained. Therefore, this study will 
use the Fl-Score as a parameter for measuring model performance. F1-Score is the harmonic mean of precision 
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and recall as shown in (2). The best score that the Fl-Score can achieve is 1, while the worst is 0. This value can 
also be represented in a percentage ranging from 0-100%, which will also be used in this study. 


eee precision-recall (2) 


precision+recall 


input: [(None, 90)] 
embedding_1_input: InputLayer 


[neat [ore 0 
output: (None, 90, 100) 
input: (None, 90, 100) 


bidirectional_1(Istm_1): Bidirectional(LSTM) 
[eta | one 80,256) 


embedding_1: Embedding 


input: (None, 90, 256) 
time_distributed_1(dense_1): TimeDistributed(Dense) 
output: (None, 90, 7) 
input: (None, 90, 7) 
erf_1: CRF 
output: (None, 90, 7) 


Figure 1. BiLSTM-CREF architecture 


3. RESULTS AND DISCUSSION 

In this study, we set the hyperparameter with various values to obtain the best scenario. The initial 
scenario in this study used the Skip-gram algorithm for Word2Vec, 128 LSTM units, the dropout in the 
LSTM layer is 0.5, TanH as the activation function in the dense layer, Adam optimization function, 32 batch 
size, and 30 epochs. To get the best model performance for NER in the tourism domain, we made seven 
different scenarios, as shown in Table 4. We conducted scenarios | to 7 sequentially. The configuration that 
produces the best performance in each scenario will be used in the subsequent scenarios. 


Table 4. Hyperparameter setting 


Scenario Hyperparameter Configuration 
1 Learning rate 0.01 
0.001 
0.0001 
2 Word2Vec algorithm Skip-gram 
CBOW 
3 Dense layer’s activation function TanH 
ReLU 
Linear 
4 Batch size 32 
64 
5 LSTM unit 100 
128 
6 Epoch 30 
50 
7 Optimization function Adam 
Nadam 


Based on all scenarios that have been done, the model with the best scenario obtained an average 
Fl-Score of 75.25% and used configurations shown in Table 5. In addition, the accuracy and average F1- 
Score generated for each scenario are shown in Table 6. Table 7 presents the best Fl-Score for each type of 
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tourist attraction. The best model is then used to detect named entities in new data. The detected entities are 
tourist attractions classified into three categories: Heritage Attraction, Purposeful Built (Man-Made) 
Attraction, and Natural Attraction. Meanwhile, words that are not tourist attractions will be included in the 
outside category. 

Figure 2 illustrates the examples of the named entity detection results from our dataset. The result 
shows that our best model was able to detect some named entities correctly. However, our NER model still 
makes some mistakes while predicting the label. For example, several types of tourist attractions were still 
wrongly detected as tourist attractions. This mistake may happen due to the lack of tourist attractions in our 
dataset. Therefore, our model fails in predicting the entity. 


Table 5. Hyperparameter setting for the best model 


Hyperparameter Configuration 
Learning rate 0.001 
Word2Vec algorithm Skip-gram 

Dense layer’s activation function Linear 

Batch size 32 

LSTM unit 128 

Epoch 50 
Optimization function Adam 


Table 6. Accuracy and average Fl-Score for each scenario 


s : u ; Confi ti Training Validation Testing Average Fl- 
geara Ba areal CHE Benen Accuracy (%) Accuracy (%) Accuracy (%) Score (%) 
1 Learning rate 0.01 98.51 98.52 98.55 14.27 

0.001 99.67 98.80 99.43 68.68 
0.0001 98.97 98.80 98.94 38.83 
2 Word2Vec algorithm Skip-gram 99.67 98.80 99.43 68.68 
CBOW 99.50 98.48 99.14 53.01 
3 Dense layer’s TanH 99.67 98.80 99.43 68.68 
activation function Linear 99.71 98.79 99.46 71.66 
ReLU 99.37 98.7 99.19 54.45 
4 Batch size 32 99.71 98.79 99.46 71.66 
64 99.61 98.79 99.37 67 
5 LSTM unit 100 99.56 98.77 99.34 66 
128 99.71 98.79 99.46 71.66 
6 Epoch 30 99.71 98.79 99.46 71.66 
50 99.82 98.73 99.52 75.25 
7 Optimization function Adam 99.82 98.73 99.52 75.25 
Nadam 99.82 98.79 99.51 74.61 


Table 7. The best Fl-Score for each type of tourist attraction 


Label Fl-Score 

Fold! Fold2  Fold3 ~=Fold4 Fold5 ~Fold6 Fold 7 Fold 8 Fold 9 Fold 10 Average 
HERITAGE 0.45 0.40 0.60 0.76 0.82 0.85 0.86 0.79 0.83 0.76 0.712 
NATURAL 0.47 0.58 0.68 0.79 0.86 0.87 0.85 0.86 0.87 0.89 0.772 
PURPOSE 0.48 0.54 0.66 0.77 0.85 0.88 0.85 0.84 0.90 0.89 0.766 

the : 0 ¥ as is : 0 

> Imperial : B-PURPOSE © However ses Othe :0 

Palace : I-purposse D > a Sensoji :0 

was : 0 vend ee Temple : I-HERITAGE 

noun :0 Park : I-PURPOSE in 0 

as : 0 1 70 Tokyo : 0 

Edo : B-HERITAGE an 22 : : 0 

Castle : I-HERITAGE incredible 0 By : 0 

: 0 Tokyo ; : 0 most =O 

home : 0 attraction 0 accounts : 0 

to :0 throughout ° : 0 

samurai : 0 the : 0 the : 0 

warriors fe) year 70 temple : 0 

: : 0 thanks : 0 2 :0 

It : 0 to : 0 s :0 


Figure 2. The example of entity detection results 


Bidirectional long-short term memory and conditional random field for tourism named... (Annisa Zahra) 


1276 O ISSN: 2252-8938 


4. CONCLUSION 

In this study, a named entity recognition system for tourism has been presented. We focused on 
extracting tourism entities based on their categories: natural attraction, heritage attraction, and purposefully 
built (artificial) attraction. The experiments have shown that the proposed BiLSTM-CRF algorithm has 
demonstrated promising results in identifying named entities from the tourism dataset. We also found that the 
application of word2vec with skip-gram can improve the performance of the named entity system. This 
research has produced a model that could predict new data quite well, but there were still some mistakes in 
the detection. We experimented with various scenarios, and the best model produced an average Fl-Score of 
75.25%. For future work, applying any other word representation models can be considered to improve the 
performance of named entity detection. In addition, we suggest adding more entity labels, such as country, 
city location, and tourism name. 
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